Dow Jones Industrial Average (DJIA) prediction

We will be predicting the DJIA closing value by using the top 25 headlines for the day.

The Dow Jones Industrial Average, Dow Jones, or simply the Dow, is a stock market index that measures the stock performance of 30 large companies listed on stock exchanges in the United States.

image.png

Step 1 - Loading all the different libraries

In [6]:
import time
start = time.time()
import os

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.model_selection import cross_val_score #sklearn features various ML algorithms
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
from nltk.sentiment.vader import SentimentIntensityAnalyzer #sentiment analysis 
from textblob import TextBlob #processing textual data

#plotting
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt

#statistics & econometrics
import statsmodels.tsa.api as smt
import statsmodels.api as sm

#model fiiting and selection
from sklearn.metrics import mean_squared_error
from sklearn.metrics import make_scorer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor
from xgboost.sklearn import XGBRegressor

Step 2 - Loading the data

In [7]:
df = pd.read_csv("Combined_News_DJIA.csv", sep=',',low_memory=False,
                    parse_dates=[0])

full_stock = pd.read_csv("upload_DJIA_table.csv",low_memory=False,
                    parse_dates=[0])

#add the closing stock value to the df - this will be the y variable
df["Close"]=full_stock.Close

#show how the dataset looks like
df.head(5)
Out[7]:
Date Label Top1 Top2 Top3 Top4 Top5 Top6 Top7 Top8 ... Top17 Top18 Top19 Top20 Top21 Top22 Top23 Top24 Top25 Close
0 2008-08-08 0 b"Georgia 'downs two Russian warplanes' as cou... b'BREAKING: Musharraf to be impeached.' b'Russia Today: Columns of troops roll into So... b'Russian tanks are moving towards the capital... b"Afghan children raped with 'impunity,' U.N. ... b'150 Russian tanks have entered South Ossetia... b"Breaking: Georgia invades South Ossetia, Rus... b"The 'enemy combatent' trials are nothing but... ... b'Al-Qaeda Faces Islamist Backlash' b'Condoleezza Rice: "The US would not act to p... b'This is a busy day: The European Union has ... b"Georgia will withdraw 1,000 soldiers from Ir... b'Why the Pentagon Thinks Attacking Iran is a ... b'Caucasus in crisis: Georgia invades South Os... b'Indian shoe manufactory - And again in a se... b'Visitors Suffering from Mental Illnesses Ban... b"No Help for Mexico's Kidnapping Surge" 17949.369141
1 2008-08-11 1 b'Why wont America and Nato help us? If they w... b'Bush puts foot down on Georgian conflict' b"Jewish Georgian minister: Thanks to Israeli ... b'Georgian army flees in disarray as Russians ... b"Olympic opening ceremony fireworks 'faked'" b'What were the Mossad with fraudulent New Zea... b'Russia angered by Israeli military sale to G... b'An American citizen living in S.Ossetia blam... ... b'"Do not believe TV, neither Russian nor Geor... b'Riots are still going on in Montreal (Canada... b'China to overtake US as largest manufacturer' b'War in South Ossetia [PICS]' b'Israeli Physicians Group Condemns State Tort... b' Russia has just beaten the United States ov... b'Perhaps *the* question about the Georgia - R... b'Russia is so much better at war' b"So this is what it's come to: trading sex fo... 17929.990234
2 2008-08-12 0 b'Remember that adorable 9-year-old who sang a... b"Russia 'ends Georgia operation'" b'"If we had no sexual harassment we would hav... b"Al-Qa'eda is losing support in Iraq because ... b'Ceasefire in Georgia: Putin Outmaneuvers the... b'Why Microsoft and Intel tried to kill the XO... b'Stratfor: The Russo-Georgian War and the Bal... b"I'm Trying to Get a Sense of This Whole Geor... ... b'Why Russias response to Georgia was right' b'Gorbachev accuses U.S. of making a "serious ... b'Russia, Georgia, and NATO: Cold War Two' b'Remember that adorable 62-year-old who led y... b'War in Georgia: The Israeli connection' b'All signs point to the US encouraging Georgi... b'Christopher King argues that the US and NATO... b'America: The New Mexico?' b"BBC NEWS | Asia-Pacific | Extinction 'by man... 17694.679688
3 2008-08-13 0 b' U.S. refuses Israel weapons to attack Iran:... b"When the president ordered to attack Tskhinv... b' Israel clears troops who killed Reuters cam... b'Britain\'s policy of being tough on drugs is... b'Body of 14 year old found in trunk; Latest (... b'China has moved 10 *million* quake survivors... b"Bush announces Operation Get All Up In Russi... b'Russian forces sink Georgian ships ' ... b'US humanitarian missions soon in Georgia - i... b"Georgia's DDOS came from US sources" b'Russian convoy heads into Georgia, violating... b'Israeli defence minister: US against strike ... b'Gorbachev: We Had No Choice' b'Witness: Russian forces head towards Tbilisi... b' Quarter of Russians blame U.S. for conflict... b'Georgian president says US military will ta... b'2006: Nobel laureate Aleksander Solzhenitsyn... 17409.720703
4 2008-08-14 1 b'All the experts admit that we should legalis... b'War in South Osetia - 89 pictures made by a ... b'Swedish wrestler Ara Abrahamian throws away ... b'Russia exaggerated the death toll in South O... b'Missile That Killed 9 Inside Pakistan May Ha... b"Rushdie Condemns Random House's Refusal to P... b'Poland and US agree to missle defense deal. ... b'Will the Russians conquer Tblisi? Bet on it,... ... b"Georgia confict could set back Russia's US r... b'War in the Caucasus is as much the product o... b'"Non-media" photos of South Ossetia/Georgia ... b'Georgian TV reporter shot by Russian sniper ... b'Saudi Arabia: Mother moves to block child ma... b'Taliban wages war on humanitarian aid workers' b'Russia: World "can forget about" Georgia\'s... b'Darfur rebels accuse Sudan of mounting major... b'Philippines : Peace Advocate say Muslims nee... 17140.240234

5 rows × 28 columns

In [8]:
#drop the label column
df2 = df.drop(['Label'], axis=1)

#view the new dataframe
df2.head(10)
Out[8]:
Date Top1 Top2 Top3 Top4 Top5 Top6 Top7 Top8 Top9 ... Top17 Top18 Top19 Top20 Top21 Top22 Top23 Top24 Top25 Close
0 2008-08-08 b"Georgia 'downs two Russian warplanes' as cou... b'BREAKING: Musharraf to be impeached.' b'Russia Today: Columns of troops roll into So... b'Russian tanks are moving towards the capital... b"Afghan children raped with 'impunity,' U.N. ... b'150 Russian tanks have entered South Ossetia... b"Breaking: Georgia invades South Ossetia, Rus... b"The 'enemy combatent' trials are nothing but... b'Georgian troops retreat from S. Osettain cap... ... b'Al-Qaeda Faces Islamist Backlash' b'Condoleezza Rice: "The US would not act to p... b'This is a busy day: The European Union has ... b"Georgia will withdraw 1,000 soldiers from Ir... b'Why the Pentagon Thinks Attacking Iran is a ... b'Caucasus in crisis: Georgia invades South Os... b'Indian shoe manufactory - And again in a se... b'Visitors Suffering from Mental Illnesses Ban... b"No Help for Mexico's Kidnapping Surge" 17949.369141
1 2008-08-11 b'Why wont America and Nato help us? If they w... b'Bush puts foot down on Georgian conflict' b"Jewish Georgian minister: Thanks to Israeli ... b'Georgian army flees in disarray as Russians ... b"Olympic opening ceremony fireworks 'faked'" b'What were the Mossad with fraudulent New Zea... b'Russia angered by Israeli military sale to G... b'An American citizen living in S.Ossetia blam... b'Welcome To World War IV! Now In High Definit... ... b'"Do not believe TV, neither Russian nor Geor... b'Riots are still going on in Montreal (Canada... b'China to overtake US as largest manufacturer' b'War in South Ossetia [PICS]' b'Israeli Physicians Group Condemns State Tort... b' Russia has just beaten the United States ov... b'Perhaps *the* question about the Georgia - R... b'Russia is so much better at war' b"So this is what it's come to: trading sex fo... 17929.990234
2 2008-08-12 b'Remember that adorable 9-year-old who sang a... b"Russia 'ends Georgia operation'" b'"If we had no sexual harassment we would hav... b"Al-Qa'eda is losing support in Iraq because ... b'Ceasefire in Georgia: Putin Outmaneuvers the... b'Why Microsoft and Intel tried to kill the XO... b'Stratfor: The Russo-Georgian War and the Bal... b"I'm Trying to Get a Sense of This Whole Geor... b"The US military was surprised by the timing ... ... b'Why Russias response to Georgia was right' b'Gorbachev accuses U.S. of making a "serious ... b'Russia, Georgia, and NATO: Cold War Two' b'Remember that adorable 62-year-old who led y... b'War in Georgia: The Israeli connection' b'All signs point to the US encouraging Georgi... b'Christopher King argues that the US and NATO... b'America: The New Mexico?' b"BBC NEWS | Asia-Pacific | Extinction 'by man... 17694.679688
3 2008-08-13 b' U.S. refuses Israel weapons to attack Iran:... b"When the president ordered to attack Tskhinv... b' Israel clears troops who killed Reuters cam... b'Britain\'s policy of being tough on drugs is... b'Body of 14 year old found in trunk; Latest (... b'China has moved 10 *million* quake survivors... b"Bush announces Operation Get All Up In Russi... b'Russian forces sink Georgian ships ' b"The commander of a Navy air reconnaissance s... ... b'US humanitarian missions soon in Georgia - i... b"Georgia's DDOS came from US sources" b'Russian convoy heads into Georgia, violating... b'Israeli defence minister: US against strike ... b'Gorbachev: We Had No Choice' b'Witness: Russian forces head towards Tbilisi... b' Quarter of Russians blame U.S. for conflict... b'Georgian president says US military will ta... b'2006: Nobel laureate Aleksander Solzhenitsyn... 17409.720703
4 2008-08-14 b'All the experts admit that we should legalis... b'War in South Osetia - 89 pictures made by a ... b'Swedish wrestler Ara Abrahamian throws away ... b'Russia exaggerated the death toll in South O... b'Missile That Killed 9 Inside Pakistan May Ha... b"Rushdie Condemns Random House's Refusal to P... b'Poland and US agree to missle defense deal. ... b'Will the Russians conquer Tblisi? Bet on it,... b'Russia exaggerating South Ossetian death tol... ... b"Georgia confict could set back Russia's US r... b'War in the Caucasus is as much the product o... b'"Non-media" photos of South Ossetia/Georgia ... b'Georgian TV reporter shot by Russian sniper ... b'Saudi Arabia: Mother moves to block child ma... b'Taliban wages war on humanitarian aid workers' b'Russia: World "can forget about" Georgia\'s... b'Darfur rebels accuse Sudan of mounting major... b'Philippines : Peace Advocate say Muslims nee... 17140.240234
5 2008-08-15 b"Mom of missing gay man: Too bad he's not a 2... b"Russia: U.S. Poland Missile Deal Won't Go 'U... b"The government has been accused of creating ... b'The Italian government has lashed out at an ... b'Gorbachev: Georgia started conflict in S. Os... b"China fakes more than your girlfriend; 'Ethn... b"The UN's criticism of freedom of expression ... b'Russian general threatens nuclear strike on ... b'Russia can inspect Polish missile defence site' ... b'Johann Hari: We need to stop being such cowa... b'US officials have said that their military p... b'Israel clears troops who killed Reuters came... b'Unenforceable laws encourage cops to escalat... b'What Chinese pollution really looks like' b'Hacker Kidnaps and Tortures Informant, Posts... b'Bush Tells Putin: This Aggression Will Not S... b'Georgia is all about the oil pipelines' b'Rivals say they plan to remove Georgian pres... 17400.750000
6 2008-08-18 b'In an Afghan prison, the majority of female ... b"Little girl, you're not ugly; they are" b"Pakistan's Musharraf to Resign, Leave the Co... b'Tornado throws a bus in Poland, captured by ... b"Britain's terror laws have left me and my fa... b"Iran 'fires satellite into space'" b'Rights of Non-Muslims restricted by new Mald... b'Tour of Tskhinvali undercuts Russian version... b'The Great Resource War is already underway, ... ... b' New porn channel lets Canadians strut their... b'The Dangerous Neighbor: Vladimir Putin Takes... b'Israel opinion page: Russians are saner.' b"NATO's Hour" b'Georgian President Saakashvili Eats His Tie ... b'No Chicken Left Behind: Animal RFID Surveill... b'Putin has given us an order that everyone mu... b'National DNA database grows on the genes of ... b'Mayor Asks Ugly Women To Visit His Town' 18011.070312
7 2008-08-19 b"Man arrested and locked up for five hours af... b'The US missile defence system is the magic p... b'Schrder lambasted for blaming Russian confli... b'Officials: 10 French soldiers killed near Ka... b'These ten laws make China a totalitarian was... b'Russia seizes US vehicles' b"Muslims are only 4% of Denmark's 5.4 million... b'Taliban Forces Kill 10 French Soldiers and R... b'Assaults, kidnappings and killings of humani... ... b'16,000 fine for British woman caught sharing... b'102-year-old grandma is oldest person on Fac... b'Today 5 years ago - August 19th 2003. Bombin... b'US national Ken Haywood, whose computer was ... b' Taliban kill 10 French troops near Afghan c... b'Not Everybody Loves Offshore Wind Power in S... b'Taliban Forces Kill 10 French Soldiers and R... b'Pakistan is more democratic than America. ' b'Blaze engulfs Egyptian parliament' 17780.830078
8 2008-08-20 b'Two elderly Chinese women have been sentence... b'The Power of Islam: The Human Rights Council... b"We had 55 times more military soldiers in th... b'"I live here on less than a dollar a month" ... b'Russia sends aircraft carrier to Syria.' b'The American people should be eternally grat... b'Abkhazia officially appeals to Russia for in... b'Russia warns of response "beyond diplomacy" ... b'India Sets Aside 40% of Regional Wasteland f... ... b'Russia has informed Norway that it plans to ... b"'What Are the Aims of this War?': French Opp... b'Bush Covered up Musharraf Ties with Al Qaeda' b'Mikhail Gorbachev: Russia Never Wanted a War' b'Germans urge tougher laws after new privacy ... b'The Time of the Wimps: Dialogue with Russia ... b'1998 Missile Strikes on Bin Laden May Have B... b"For a moment let's forget everything else an... b'The First Solar Radio Station in Argentina' 17829.730469
9 2008-08-21 b"British resident held in Guantanamo Bay wins... b'Chinese may have killed 140 Tibetans this we... b'U.S. Navy Ships Head to Georgia' b'Hacker uncovers Chinese olympic fraud' b"If you've ever wondered what Kim Jong Il was... b"Russia's Nuclear Threat Is More Than Words" b'Czech President: "I must protest aloud again... b'50% Of All Food Produced Is Wasted Before It... b"China sentences Alive in Baghdad blogger, GR... ... b'NATOs decision to freeze relations with Mosc... b'Sweet Sixteen or Fraudulent Fourteen, Hacke... b'If Russias feeling churlish, they can pretty... b'Chinese Gymnasts 14, Official Document Shows' b'Suicide attack kills at least 50 at Pakistan... b'The Abkhazian Parliament has approved an off... b'Georgia, Bulgaria and the Second Balkan War ... b"Terrorist reveals Pak's sinister designs on ... b"International Olympic Committee launches pro... 17804.869141

10 rows × 27 columns

Step 3 - Data cleaning

NA treatment : We'll simply fill the NAs in the numerical features (Date, Close). In the text features we'll fill the missing values with ''.

In [9]:
#check for NAN
df2.isnull().sum()
Out[9]:
Date     0
Top1     0
Top2     0
Top3     0
Top4     0
Top5     0
Top6     0
Top7     0
Top8     0
Top9     0
Top10    0
Top11    0
Top12    0
Top13    0
Top14    0
Top15    0
Top16    0
Top17    0
Top18    0
Top19    0
Top20    0
Top21    0
Top22    0
Top23    1
Top24    3
Top25    3
Close    0
dtype: int64

There are a few headlines missing. Let's fill them in with a whitespace.

In [10]:
#replacing nan values with a whitespace
df2 = df2.replace(np.nan, ' ', regex=True)

#sanity check
df2.isnull().sum().sum()
Out[10]:
0

Remove the HTML tags : There are several non-word tags in the headlines that would just bias the sentiment analysis so we need to remove them and replace with ''. This can be done with regex.

In [11]:
#Remove the HTML tags
df2 = df2.replace('b\"|b\'|\\\\|\\\"', '', regex=True)
df2.head(2)
Out[11]:
Date Top1 Top2 Top3 Top4 Top5 Top6 Top7 Top8 Top9 ... Top17 Top18 Top19 Top20 Top21 Top22 Top23 Top24 Top25 Close
0 2008-08-08 Georgia 'downs two Russian warplanes' as count... BREAKING: Musharraf to be impeached.' Russia Today: Columns of troops roll into Sout... Russian tanks are moving towards the capital o... Afghan children raped with 'impunity,' U.N. of... 150 Russian tanks have entered South Ossetia w... Breaking: Georgia invades South Ossetia, Russi... The 'enemy combatent' trials are nothing but a... Georgian troops retreat from S. Osettain capit... ... Al-Qaeda Faces Islamist Backlash' Condoleezza Rice: The US would not act to prev... This is a busy day: The European Union has ap... Georgia will withdraw 1,000 soldiers from Iraq... Why the Pentagon Thinks Attacking Iran is a Ba... Caucasus in crisis: Georgia invades South Osse... Indian shoe manufactory - And again in a seri... Visitors Suffering from Mental Illnesses Banne... No Help for Mexico's Kidnapping Surge 17949.369141
1 2008-08-11 Why wont America and Nato help us? If they won... Bush puts foot down on Georgian conflict' Jewish Georgian minister: Thanks to Israeli tr... Georgian army flees in disarray as Russians ad... Olympic opening ceremony fireworks 'faked' What were the Mossad with fraudulent New Zeala... Russia angered by Israeli military sale to Geo... An American citizen living in S.Ossetia blames... Welcome To World War IV! Now In High Definition!' ... Do not believe TV, neither Russian nor Georgia... Riots are still going on in Montreal (Canada) ... China to overtake US as largest manufacturer' War in South Ossetia [PICS]' Israeli Physicians Group Condemns State Torture' Russia has just beaten the United States over... Perhaps *the* question about the Georgia - Rus... Russia is so much better at war' So this is what it's come to: trading sex for ... 17929.990234

2 rows × 27 columns

Sentiment and subjectivity score extraction : Now we run the sentiment analysis extracting the compound score that goes from -0.5 (most negative) to 0.5 (most positive). I'm going to use the "dirty" texts in this part because VADER can utilize the information such as ALL CAPS, punctuation, etc. We'll also calculate the subjectivity of each headline using the TextBlob package.

Initialise the VADER analyzer.

In [12]:
#Sentiment and subjectivity score extraction
Anakin = SentimentIntensityAnalyzer()

Anakin.polarity_scores(" ")
Out[12]:
{'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound': 0.0}

Write a function to save the subjectivity score directly from TextBlob function's output. Subjectivity score might detect direct quotes in the headlines and positive stuff is rarely quoted in the headline.

In [13]:
def detect_subjectivity(text):
    return TextBlob(text).sentiment.subjectivity

detect_subjectivity(" ") #should return 0
Out[13]:
0.0
In [14]:
start_vect=time.time()
print("ANAKIN: 'Intializing the process..'")

#get the name of the headline columns
cols = []
for i in range(1,26):
    col = ("Top{}".format(i))
    cols.append(col)


for col in cols:
    df2[col] = df2[col].astype(str) # Make sure data is treated as a string
    df2[col+'_comp']= df2[col].apply(lambda x:Anakin.polarity_scores(x)['compound'])
    df2[col+'_sub'] = df2[col].apply(detect_subjectivity)
    print("{} Done".format(col))
    
print("VADER: Vaderization completed after %0.2f Minutes"%((time.time() - start_vect)/60))
ANAKIN: 'Intializing the process..'
Top1 Done
Top2 Done
Top3 Done
Top4 Done
Top5 Done
Top6 Done
Top7 Done
Top8 Done
Top9 Done
Top10 Done
Top11 Done
Top12 Done
Top13 Done
Top14 Done
Top15 Done
Top16 Done
Top17 Done
Top18 Done
Top19 Done
Top20 Done
Top21 Done
Top22 Done
Top23 Done
Top24 Done
Top25 Done
VADER: Vaderization completed after 0.54 Minutes
In [15]:
#the text isn't required anymore
df2 = df2.drop(cols,axis=1)
df2.head(5)
Out[15]:
Date Close Top1_comp Top1_sub Top2_comp Top2_sub Top3_comp Top3_sub Top4_comp Top4_sub ... Top21_comp Top21_sub Top22_comp Top22_sub Top23_comp Top23_sub Top24_comp Top24_sub Top25_comp Top25_sub
0 2008-08-08 17949.369141 -0.5994 0.0 0.0000 0.000000 -0.3612 0.000000 -0.7089 0.200000 ... -0.7579 0.666667 -0.6249 0.0 -0.2755 0.00 -0.8519 0.200000 0.1280 0.0
1 2008-08-11 17929.990234 0.8156 0.0 -0.3182 0.288889 0.4404 0.100000 -0.1965 0.000000 ... -0.8020 0.000000 0.0000 0.0 -0.3182 0.00 -0.1832 0.500000 0.0000 0.0
2 2008-08-12 17694.679688 0.0258 1.0 0.0000 0.000000 -0.7845 0.833333 -0.6124 1.000000 ... -0.5994 0.000000 0.5267 0.0 0.3818 0.35 0.0000 0.454545 0.0000 0.0
3 2008-08-13 17409.720703 -0.7184 0.0 -0.8074 0.000000 -0.6369 0.000000 -0.1280 0.444444 ... -0.2960 0.000000 0.4939 0.0 -0.5719 0.00 -0.4215 0.100000 -0.3400 0.0
4 2008-08-14 17140.240234 0.2023 0.0 -0.5994 0.000000 0.6808 0.400000 -0.8689 0.666667 ... -0.4404 0.000000 -0.5994 0.0 0.1779 0.00 -0.6908 0.500000 0.7096 0.0

5 rows × 52 columns

Summarise the compound and subjectivity scores weighted by rating of the headline (top1 has the most weight)

In [16]:
comp_cols = []
for col in cols:
    comp_col = col + "_comp"
    comp_cols.append(comp_col)

w = np.arange(1,26,1).tolist()
w.reverse()

weighted_comp = []
max_comp = []
min_comp = []
for i in range(0,len(df)):
    a = df2.loc[i,comp_cols].tolist()
    weighted_comp.append(np.average(a, weights=w))
    max_comp.append(max(a))
    min_comp.append(min(a))

df2['compound_mean'] = weighted_comp
df2['compound_max'] = max_comp
df2['compound_min'] = min_comp


sub_cols = []
for col in cols:
    sub_col = col + "_sub"
    sub_cols.append(sub_col)
    
weighted_sub = []
max_sub = []
min_sub = []
for i in range(0,len(df2)):
    a = df2.loc[i,sub_cols].tolist()
    weighted_sub.append(np.average(a, weights=w))
    max_sub.append(max(a))
    min_sub.append(min(a))

df2['subjectivity_mean'] = weighted_sub
df2['subjectivity_max'] = max_sub
df2['subjectivity_min'] = min_sub

to_drop = sub_cols+comp_cols
df2 = df2.drop(to_drop, axis=1)

df2.head(5)
Out[16]:
Date Close compound_mean compound_max compound_min subjectivity_mean subjectivity_max subjectivity_min
0 2008-08-08 17949.369141 -0.350337 0.2144 -0.9260 0.163685 0.666667 0.0
1 2008-08-11 17929.990234 -0.085277 0.8156 -0.8271 0.202921 0.720000 0.0
2 2008-08-12 17694.679688 -0.318394 0.5423 -0.8591 0.374076 1.000000 0.0
3 2008-08-13 17409.720703 -0.162032 0.5106 -0.8074 0.176371 0.900000 0.0
4 2008-08-14 17140.240234 -0.194879 0.7177 -0.8689 0.319615 1.000000 0.0

4. Exploratory Data Analysis

First, the timeseries of the y (to be predicted) variable will be explored. It's likely the the timeseries isn't stationary which however doesn't worry us in this case as the models won't be of the classical timeseries methods family.

In [17]:
#EDA-time series plot
fig1 = go.Figure()
fig1.add_trace(go.Scatter(x=df2.Date, y=df.Close,
                    mode='lines'))
title = []
title.append(dict(xref='paper', yref='paper', x=0.0, y=1.05,
                              xanchor='left', yanchor='bottom',
                              text='Development of stock values from Aug, 2008 to Jun, 2016',
                              font=dict(family='Arial',
                                        size=30,
                                        color='rgb(37,37,37)'),
                              showarrow=False))
fig1.update_layout(xaxis_title='Date',
                   yaxis_title='Closing stock value (in $)',
                  annotations=title)
fig1.show()

It is quite obvious that the timeseries isn't stationary at all. There just seems to be a downwards trend over the time. So let's look at how much unstationary the timeseries actually is using Dickey-Fuller Test.

The Dickey-Fuller test is a test that tests the null hypothesis that a unit root is present in time series data. To make things a bit more clear, this test is checking for stationarity or non-stationary data. The test is trying to reject the null hypothesis that a unit root exists and the data is non-stationary.

In [18]:
#function for quick plotting and testing of stationarity
def stationary_plot(y, lags=None, figsize=(12, 7), style='bmh'):
    """
        Plot time series, its ACF and PACF, calculate Dickey–Fuller test
        
        y - timeseries
        lags - how many lags to include in ACF, PACF calculation
    """
    if not isinstance(y, pd.Series):
        y = pd.Series(y)
        
    with plt.style.context(style):    
        fig = plt.figure(figsize=figsize)
        layout = (2, 2)
        ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2)
        acf_ax = plt.subplot2grid(layout, (1, 0))
        pacf_ax = plt.subplot2grid(layout, (1, 1))
        
        y.plot(ax=ts_ax)
        p_value = sm.tsa.stattools.adfuller(y)[1]
        ts_ax.set_title('Time Series Analysis Plots\n Dickey-Fuller: p={0:.5f}'.format(p_value))
        smt.graphics.plot_acf(y, lags=lags, ax=acf_ax)
        smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax)
        plt.tight_layout()
In [19]:
stationary_plot(df2.Close)

Next we look at the compound sentiment scores.

In [20]:
fig2 = go.Figure()
fig2.add_trace(go.Scatter(x=df2.Date, y=df2.compound_mean,
                    mode='lines',
                    name='Mean'))
fig2.add_trace(go.Scatter(x=df2.Date, y=df2.compound_max,
                    mode='lines',
                    name='Maximum'))
fig2.add_trace(go.Scatter(x=df2.Date, y=df2.compound_min,
                    mode='lines',
                    name='Minimum'))
title = []
title.append(dict(xref='paper', yref='paper', x=0.0, y=1.05,
                              xanchor='left', yanchor='bottom',
                              text='Development of sentiment compound score',
                               font=dict(family='Arial',
                                       size=30,
                                        color='rgb(37,37,37)'),
                              showarrow=False))
fig2.update_layout(xaxis_title='Date',
                   yaxis_title='Compound score',
                  annotations=title)
fig2.show()

Let's also plot the distribution of the compound score.

In [21]:
compm_hist = px.histogram(df2, x="compound_mean")
compm_hist.show()

Now, we take a look at the subjectivity scores.

In [22]:
fig3 = go.Figure()
fig3.add_trace(go.Scatter(x=df2.Date, y=df2.subjectivity_mean,
                    mode='lines',
                    name='Mean'))
fig3.add_trace(go.Scatter(x=df2.Date, y=df2.subjectivity_min,
                    mode='lines',
                    name='Min'))
fig3.add_trace(go.Scatter(x=df2.Date, y=df2.subjectivity_max,
                    mode='lines',
                    name='Max'))
title = []
title.append(dict(xref='paper', yref='paper', x=0.0, y=1.05,
                              xanchor='left', yanchor='bottom',
                              text='Development of subjectivity score',
                              font=dict(family='Arial',
                                        size=30,
                                        color='rgb(37,37,37)'),
                              showarrow=False))
fig3.update_layout(xaxis_title='Date',
                   yaxis_title='Subjectivity score',
                  annotations=title)
fig3.show()

Now we plot distribution of the subjectivity scores as well.

In [23]:
subm_hist = px.histogram(df2, x="subjectivity_mean")
subm_hist.show()

Next we'll look at some descriptive statistics about the data.

In [24]:
df2.describe()
Out[24]:
Close compound_mean compound_max compound_min subjectivity_mean subjectivity_max subjectivity_min
count 1989.000000 1989.000000 1989.000000 1989.000000 1989.000000 1989.000000 1989.0
mean 13463.032255 -0.213362 0.656276 -0.885679 0.251779 0.892645 0.0
std 3144.006996 0.106789 0.165663 0.062831 0.066196 0.144796 0.0
min 6547.049805 -0.546383 0.000000 -0.989800 0.080897 0.416667 0.0
25% 10913.379883 -0.283481 0.542300 -0.931600 0.205581 0.785714 0.0
50% 13025.580078 -0.210605 0.670500 -0.897900 0.249927 1.000000 0.0
75% 16478.410156 -0.141434 0.784500 -0.851900 0.297009 1.000000 0.0
max 18312.390625 0.166370 0.962300 -0.571900 0.470498 1.000000 0.0

Feature selection

We are not going to use many FS methods since the features were mostly handcrafted. So we'll simply look at their variance and proportion of unique values.

In [25]:
def unique_ratio (col):
    return len(np.unique(col))/len(col)

cols = ['Close', 'compound_mean', 'compound_max', 'compound_min', 'subjectivity_mean', 'subjectivity_max', 'subjectivity_min']

ur = []
var = []
for col in cols:
    ur.append(unique_ratio(df2[col]))
    var.append(np.var(df2[col]))
    
feature_sel = pd.DataFrame({'Column': cols, 
              'Unique': ur,
              'Variance': var})
feature_sel
Out[25]:
Column Unique Variance
0 Close 0.994470 9.879810e+06
1 compound_mean 1.000000 1.139823e-02
2 compound_max 0.188034 2.743036e-02
3 compound_min 0.176471 3.945746e-03
4 subjectivity_mean 0.999497 4.379750e-03
5 subjectivity_max 0.083962 2.095546e-02
6 subjectivity_min 0.000503 0.000000e+00
In [26]:
sel_fig = go.Figure(data=go.Scatter(
    x=feature_sel.Column,
    y=feature_sel.Unique,
    mode='markers',
    marker=dict(size=(feature_sel.Unique*100)),
))
sel_fig.update_layout(title='Ratio of unique values', 
                      yaxis_title='Unique ratio')
sel_fig.show()

Compound maximum and minimum are potentially less interesting as only ~18% of their values are unique. Also maximum of subjectivity has very low values. Minimum subjectivity has contains almost only 0. We'll drop the subjectivity min and max.

In [27]:
drop = ['subjectivity_min', 'subjectivity_max']
clean_df = df2.drop(drop,axis=1)

Step 5 - Lag the extracted features

To allow the models to look into the past, we'll add features which are essentially just copies of rows from n-steps back. In order to not create too many new features we'll add only features from 1 week prior to the current datapoint.

In [28]:
lag_df = clean_df.copy()
lag_df.head(3)
Out[28]:
Date Close compound_mean compound_max compound_min subjectivity_mean
0 2008-08-08 17949.369141 -0.350337 0.2144 -0.9260 0.163685
1 2008-08-11 17929.990234 -0.085277 0.8156 -0.8271 0.202921
2 2008-08-12 17694.679688 -0.318394 0.5423 -0.8591 0.374076
In [29]:
to_lag = list(lag_df.columns)
to_lag_4 = to_lag[1]
to_lag_1 = to_lag[2:len(to_lag)]
In [30]:
#lagging text features two days back
for col in to_lag_1:
    for i in range(1,3):
        new_name = col + ('_lag_{}'.format(i))
        lag_df[new_name] = lag_df[col].shift(i)
    
#lagging closing values 4 days back
for i in range(1, 5):
    new_name = to_lag_4 + ('_lag_{}'.format(i))
    lag_df[new_name] = lag_df[to_lag_4].shift(i)

In this process, rows with NAs were created. Unfortunately these rows will have to be removed since we simply don't have the data from the future.

In [31]:
#Show many rows need to be removed
lag_df.head(10) 
Out[31]:
Date Close compound_mean compound_max compound_min subjectivity_mean compound_mean_lag_1 compound_mean_lag_2 compound_max_lag_1 compound_max_lag_2 compound_min_lag_1 compound_min_lag_2 subjectivity_mean_lag_1 subjectivity_mean_lag_2 Close_lag_1 Close_lag_2 Close_lag_3 Close_lag_4
0 2008-08-08 17949.369141 -0.350337 0.2144 -0.9260 0.163685 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2008-08-11 17929.990234 -0.085277 0.8156 -0.8271 0.202921 -0.350337 NaN 0.2144 NaN -0.9260 NaN 0.163685 NaN 17949.369141 NaN NaN NaN
2 2008-08-12 17694.679688 -0.318394 0.5423 -0.8591 0.374076 -0.085277 -0.350337 0.8156 0.2144 -0.8271 -0.9260 0.202921 0.163685 17929.990234 17949.369141 NaN NaN
3 2008-08-13 17409.720703 -0.162032 0.5106 -0.8074 0.176371 -0.318394 -0.085277 0.5423 0.8156 -0.8591 -0.8271 0.374076 0.202921 17694.679688 17929.990234 17949.369141 NaN
4 2008-08-14 17140.240234 -0.194879 0.7177 -0.8689 0.319615 -0.162032 -0.318394 0.5106 0.5423 -0.8074 -0.8591 0.176371 0.374076 17409.720703 17694.679688 17929.990234 17949.369141
5 2008-08-15 17400.750000 -0.143104 0.4404 -0.7481 0.227282 -0.194879 -0.162032 0.7177 0.5106 -0.8689 -0.8074 0.319615 0.176371 17140.240234 17409.720703 17694.679688 17929.990234
6 2008-08-18 18011.070312 -0.263546 0.5106 -0.9260 0.216935 -0.143104 -0.194879 0.4404 0.7177 -0.7481 -0.8689 0.227282 0.319615 17400.750000 17140.240234 17409.720703 17694.679688
7 2008-08-19 17780.830078 -0.373172 0.5574 -0.8720 0.256786 -0.263546 -0.143104 0.5106 0.4404 -0.9260 -0.7481 0.216935 0.227282 18011.070312 17400.750000 17140.240234 17409.720703
8 2008-08-20 17829.730469 -0.197157 0.4847 -0.8807 0.095403 -0.373172 -0.263546 0.5574 0.5106 -0.8720 -0.9260 0.256786 0.216935 17780.830078 18011.070312 17400.750000 17140.240234
9 2008-08-21 17804.869141 -0.268522 0.5719 -0.9022 0.107994 -0.197157 -0.373172 0.4847 0.5574 -0.8807 -0.8720 0.095403 0.256786 17829.730469 17780.830078 18011.070312 17400.750000

Above we can see that the first 4 rows now have missing values. Let's delete them and reset index.

In [32]:
lag_df = lag_df.drop(lag_df.index[[np.arange(0,4)]])
lag_df = lag_df.reset_index(drop=True)

#sanity check for NaNs
lag_df.isnull().sum().sum()
C:\Users\ankit\Anaconda3\lib\site-packages\pandas\core\indexes\base.py:3941: FutureWarning:

Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

Out[32]:
0
In [33]:
lag_df.head(5)
Out[33]:
Date Close compound_mean compound_max compound_min subjectivity_mean compound_mean_lag_1 compound_mean_lag_2 compound_max_lag_1 compound_max_lag_2 compound_min_lag_1 compound_min_lag_2 subjectivity_mean_lag_1 subjectivity_mean_lag_2 Close_lag_1 Close_lag_2 Close_lag_3 Close_lag_4
0 2008-08-14 17140.240234 -0.194879 0.7177 -0.8689 0.319615 -0.162032 -0.318394 0.5106 0.5423 -0.8074 -0.8591 0.176371 0.374076 17409.720703 17694.679688 17929.990234 17949.369141
1 2008-08-15 17400.750000 -0.143104 0.4404 -0.7481 0.227282 -0.194879 -0.162032 0.7177 0.5106 -0.8689 -0.8074 0.319615 0.176371 17140.240234 17409.720703 17694.679688 17929.990234
2 2008-08-18 18011.070312 -0.263546 0.5106 -0.9260 0.216935 -0.143104 -0.194879 0.4404 0.7177 -0.7481 -0.8689 0.227282 0.319615 17400.750000 17140.240234 17409.720703 17694.679688
3 2008-08-19 17780.830078 -0.373172 0.5574 -0.8720 0.256786 -0.263546 -0.143104 0.5106 0.4404 -0.9260 -0.7481 0.216935 0.227282 18011.070312 17400.750000 17140.240234 17409.720703
4 2008-08-20 17829.730469 -0.197157 0.4847 -0.8807 0.095403 -0.373172 -0.263546 0.5574 0.5106 -0.8720 -0.9260 0.256786 0.216935 17780.830078 18011.070312 17400.750000 17140.240234

Step 8 - Model training

Let's train 3 ML models. We'll do this in 2 rounds. First, using the econometric features alone (7 lags of y). Second, including the information extracted from the headlines (compound, subjectivity and their lags)

Models:

1)Ridge regression - punish model for using too many features but doesn't allow the coeficients drop to zero completely

2)Random forest

3)XGBoost

We'll score all models by mean squared error as it gives higher penalty to larger mistakes. And before each model training we'll standardize the training data.

The first step will be creating folds for cross-validation. We'll use the same folds for all models in order to allow for creating a meta-model. Since we're working with timeseries the folds cannot be randomly selected. Instead a fold will be a sequence of data so that we don't lose the time information.

In [34]:
# for time-series cross-validation set 10 folds 
tscv = TimeSeriesSplit(n_splits=10)

The cost function to minimize is mean squared error because this function assigns cost proportionally to the error size. The mean absolute percentage error will be used for plotting and easier interpretation. It's much easier to understand the errors of a model in terms of percentage. Each training set is scaled (normalized) independently to minimize data leakage.

In [35]:
def mape(y_true, y_pred): 
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
scorer = make_scorer(mean_squared_error)
scaler = StandardScaler()   

Next we split the dataset into training and testing. 20% of the data will be used for testing.

In [36]:
def ts_train_test_split(X, y, test_size):
    """
        Perform train-test split with respect to time series structure
    """
    
    # get the index after which test set starts
    test_index = int(len(X)*(1-test_size))
    
    X_train = X.iloc[:test_index]
    y_train = y.iloc[:test_index]
    X_test = X.iloc[test_index:]
    y_test = y.iloc[test_index:]
    
    return X_train, X_test, y_train, y_test
In [37]:
X = lag_df.drop(['Close'],axis=1)
X.index = X["Date"]
X = X.drop(['Date'],axis=1)
y = lag_df.Close

X_train, X_test, y_train, y_test = ts_train_test_split(X, y, test_size = 0.2)

#sanity check
(len(X_train)+len(X_test))==len(X)
Out[37]:
True
In [38]:
#function for plotting coeficients of models (lasso and XGBoost)
def plotCoef(model,train_x):
    """
        Plots sorted coefficient values of the model
    """
    
    coefs = pd.DataFrame(model.coef_, train_x.columns)
    coefs.columns = ["coef"]
    coefs["abs"] = coefs.coef.apply(np.abs)
    coefs = coefs.sort_values(by="abs", ascending=False).drop(["abs"], axis=1)
    
    plt.figure(figsize=(15, 7))
    coefs.coef.plot(kind='bar')
    plt.grid(True, axis='y')
    plt.hlines(y=0, xmin=0, xmax=len(coefs), linestyles='dashed');

Step 8.1 - Econometric models

First let's train models using only the lags of the y variable (i.e. Close).

In [39]:
econ_cols = list(X_train.columns)
econ_cols = econ_cols[12:17]
X_train_e = X_train[econ_cols]
X_test_e = X_test[econ_cols]
y_train_e = y_train
y_test_e = y_test
In [40]:
econ_perf = pd.DataFrame(columns=['Model','MSE', 'SD'])
econ_perf
Out[40]:
Model MSE SD

Ridge regression

In [41]:
ridge_param = {'model__alpha': list(np.arange(0.001,1,0.001))}
ridge = Ridge(max_iter=5000)
pipe = Pipeline([
    ('scale', scaler),
    ('model', ridge)])
search_ridge = GridSearchCV(estimator=pipe,
                          param_grid = ridge_param,
                          scoring = scorer,
                          cv = tscv,
                          n_jobs=4
                         )
search_ridge.fit(X_train_e, y_train_e)
Out[41]:
GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=10),
             estimator=Pipeline(steps=[('scale', StandardScaler()),
                                       ('model', Ridge(max_iter=5000))]),
             n_jobs=4,
             param_grid={'model__alpha': [0.001, 0.002, 0.003, 0.004, 0.005,
                                          0.006, 0.007, 0.008,
                                          0.009000000000000001,
                                          0.010000000000000002, 0.011, 0.012,
                                          0.013000000000000001,
                                          0.014000000000000002, 0.015, 0.016,
                                          0.017, 0.018000000000000002,
                                          0.019000000000000003, 0.02, 0.021,
                                          0.022000000000000002, 0.023, 0.024,
                                          0.025, 0.026000000000000002,
                                          0.027000000000000003, 0.028, 0.029,
                                          0.030000000000000002, ...]},
             scoring=make_scorer(mean_squared_error))
In [42]:
ridge_e = search_ridge.best_estimator_

#get cv results of the best model + confidence intervals
from sklearn.model_selection import cross_val_score
cv_score = cross_val_score(ridge_e, X_train_e, y_train_e, cv=tscv, scoring=scorer)
econ_perf = econ_perf.append({'Model':'Ridge', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True)
ridge_e
Out[42]:
Pipeline(steps=[('scale', StandardScaler()),
                ('model', Ridge(alpha=0.999, max_iter=5000))])
In [43]:
plotCoef(ridge_e['model'], X_train_e)
In [44]:
coefs = ridge_e['model'].coef_
ridge_coefs = pd.DataFrame({'Coef': coefs,
                           'Name': list(X_train_e.columns)})
ridge_coefs["abs"] = ridge_coefs.Coef.apply(np.abs)
ridge_coefs = ridge_coefs.sort_values(by="abs", ascending=False).drop(["abs"], axis=1)
ridge_coefs
Out[44]:
Coef Name
0 2041.601673 Close_lag_1
1 414.480793 Close_lag_2
3 58.219956 Close_lag_4
2 30.646149 Close_lag_3
In [45]:
econ_perf
Out[45]:
Model MSE SD
0 Ridge 16148.54356 8475.253033

Random forest

In [46]:
rf_param = {'model__n_estimators': [10, 100, 300],
            'model__max_depth': [10, 20, 30, 40],
            'model__min_samples_split': [2, 5, 10],
            'model__min_samples_leaf': [1, 2, 3],
            'model__max_features': ["auto", 'sqrt']}
In [47]:
rf = RandomForestRegressor()
pipe = Pipeline([
    ('scale', scaler),
    ('model', rf)])
gridsearch_rf = GridSearchCV(estimator=pipe,
                          param_grid = rf_param,
                          scoring = scorer,
                          cv = tscv,
                          n_jobs=4,
                          verbose=3
                         )
In [48]:
gridsearch_rf.fit(X_train_e, y_train_e)
Fitting 10 folds for each of 216 candidates, totalling 2160 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:    3.7s
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:   24.7s
[Parallel(n_jobs=4)]: Done 280 tasks      | elapsed:   51.2s
[Parallel(n_jobs=4)]: Done 504 tasks      | elapsed:  1.4min
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:  2.4min
[Parallel(n_jobs=4)]: Done 1144 tasks      | elapsed:  3.4min
[Parallel(n_jobs=4)]: Done 1560 tasks      | elapsed:  4.6min
[Parallel(n_jobs=4)]: Done 2040 tasks      | elapsed:  5.9min
[Parallel(n_jobs=4)]: Done 2160 out of 2160 | elapsed:  6.2min finished
Out[48]:
GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=10),
             estimator=Pipeline(steps=[('scale', StandardScaler()),
                                       ('model', RandomForestRegressor())]),
             n_jobs=4,
             param_grid={'model__max_depth': [10, 20, 30, 40],
                         'model__max_features': ['auto', 'sqrt'],
                         'model__min_samples_leaf': [1, 2, 3],
                         'model__min_samples_split': [2, 5, 10],
                         'model__n_estimators': [10, 100, 300]},
             scoring=make_scorer(mean_squared_error), verbose=3)
In [49]:
rf_e = gridsearch_rf.best_estimator_

#get cv results of the best model + confidence intervals
cv_score = cross_val_score(rf_e, X_train_e, y_train_e, cv=tscv, scoring=scorer)
econ_perf = econ_perf.append({'Model':'RF', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True)
In [50]:
econ_perf
Out[50]:
Model MSE SD
0 Ridge 16148.543560 8475.253033
1 RF 175910.783385 162425.696179

XGBoost

Using linear booster because tree methods don't work with timeseries very well

In [52]:
xgb_param = {'model__lambda': list(np.arange(0.1,3, 0.1)), #L2 regularisation
             'model__alpha': list(np.arange(0.1,3, 0.1)),  #L1 regularisation
            }
In [53]:
xgb = XGBRegressor(booster='gblinear', feature_selector='shuffle', objective='reg:squarederror')

pipe = Pipeline([
    ('scale', scaler),
    ('model', xgb)])
gridsearch_xgb = GridSearchCV(estimator=pipe,
                          param_grid = xgb_param,
                          scoring = scorer,
                          cv = tscv,
                          n_jobs=4,
                          verbose=3
                         )
In [54]:
gridsearch_xgb.fit(X_train_e, y_train_e)
Fitting 10 folds for each of 841 candidates, totalling 8410 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:    1.7s
[Parallel(n_jobs=4)]: Done 380 tasks      | elapsed:    5.3s
[Parallel(n_jobs=4)]: Done 1020 tasks      | elapsed:   11.3s
[Parallel(n_jobs=4)]: Done 1916 tasks      | elapsed:   19.8s
[Parallel(n_jobs=4)]: Done 3068 tasks      | elapsed:   30.7s
[Parallel(n_jobs=4)]: Done 4476 tasks      | elapsed:   44.0s
[Parallel(n_jobs=4)]: Done 6140 tasks      | elapsed:   59.7s
[Parallel(n_jobs=4)]: Done 8060 tasks      | elapsed:  1.3min
[Parallel(n_jobs=4)]: Done 8410 out of 8410 | elapsed:  1.4min finished
Out[54]:
GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=10),
             estimator=Pipeline(steps=[('scale', StandardScaler()),
                                       ('model',
                                        XGBRegressor(base_score=None,
                                                     booster='gblinear',
                                                     colsample_bylevel=None,
                                                     colsample_bynode=None,
                                                     colsample_bytree=None,
                                                     feature_selector='shuffle',
                                                     gamma=None, gpu_id=None,
                                                     importance_type='gain',
                                                     interaction_constraints=None,
                                                     learni...
                         'model__lambda': [0.1, 0.2, 0.30000000000000004, 0.4,
                                           0.5, 0.6, 0.7000000000000001, 0.8,
                                           0.9, 1.0, 1.1, 1.2000000000000002,
                                           1.3000000000000003,
                                           1.4000000000000001,
                                           1.5000000000000002, 1.6,
                                           1.7000000000000002,
                                           1.8000000000000003,
                                           1.9000000000000001, 2.0, 2.1, 2.2,
                                           2.3000000000000003,
                                           2.4000000000000004,
                                           2.5000000000000004, 2.6, 2.7,
                                           2.8000000000000003,
                                           2.9000000000000004]},
             scoring=make_scorer(mean_squared_error), verbose=3)
In [55]:
xgb_e = gridsearch_xgb.best_estimator_

#get cv results of the best model + confidence intervals
cv_score = cross_val_score(xgb_e, X_train_e, y_train_e, cv=tscv, scoring=scorer)
econ_perf = econ_perf.append({'Model':'XGB', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True)
xgb_e
Out[55]:
Pipeline(steps=[('scale', StandardScaler()),
                ('model',
                 XGBRegressor(alpha=2.9000000000000004, base_score=0.5,
                              booster='gblinear', colsample_bylevel=None,
                              colsample_bynode=None, colsample_bytree=None,
                              feature_selector='shuffle', gamma=None, gpu_id=-1,
                              importance_type='gain',
                              interaction_constraints=None,
                              lambda=2.9000000000000004, learning_rate=0.5,
                              max_delta_step=None, max_depth=None,
                              min_child_weight=None, missing=nan,
                              monotone_constraints=None, n_estimators=100,
                              n_jobs=0, num_parallel_tree=None, random_state=0,
                              reg_alpha=0, reg_lambda=0, scale_pos_weight=1,
                              subsample=None, tree_method=None,
                              validate_parameters=1, verbosity=None))])
In [56]:
print(econ_perf)
   Model           MSE            SD
0  Ridge  1.614854e+04  8.475253e+03
1     RF  1.759108e+05  1.624257e+05
2    XGB  1.249194e+06  1.982986e+06
In [57]:
econ_fig = px.scatter(econ_perf, x="Model", y='MSE', color='Model', error_y="SD")
econ_fig.update_layout(title_text="Performance of models trained on lags of y")
econ_fig.show()

Step 8.2 - NLP models

Let's try now predict the stock value using only information from the news headlines.

In [58]:
X_train_n = X_train.drop(econ_cols, axis=1)
X_test_n = X_test.drop(econ_cols, axis=1)
y_train_n = y_train
y_test_n = y_test
In [59]:
nlp_perf = pd.DataFrame(columns=['Model','MSE', 'SD'])
nlp_perf
Out[59]:
Model MSE SD

Ridge regression

In [60]:
ridge_param = {'model__alpha': list(np.arange(1,10,0.1))}
ridge = Ridge(max_iter=5000)
pipe = Pipeline([
    ('scale', scaler),
    ('model', ridge)
])
search_ridge = GridSearchCV(estimator=pipe,
                          param_grid = ridge_param,
                          scoring = scorer,
                          cv = tscv,
                          n_jobs=4
                         )
search_ridge.fit(X_train_n, y_train_n)
Out[60]:
GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=10),
             estimator=Pipeline(steps=[('scale', StandardScaler()),
                                       ('model', Ridge(max_iter=5000))]),
             n_jobs=4,
             param_grid={'model__alpha': [1.0, 1.1, 1.2000000000000002,
                                          1.3000000000000003,
                                          1.4000000000000004,
                                          1.5000000000000004,
                                          1.6000000000000005,
                                          1.7000000000000006,
                                          1.8000000000000007,
                                          1.9000000000000008, 2....
                                          2.300000000000001, 2.4000000000000012,
                                          2.5000000000000013,
                                          2.6000000000000014,
                                          2.7000000000000015,
                                          2.8000000000000016,
                                          2.9000000000000017,
                                          3.0000000000000018, 3.100000000000002,
                                          3.200000000000002, 3.300000000000002,
                                          3.400000000000002, 3.500000000000002,
                                          3.6000000000000023,
                                          3.7000000000000024,
                                          3.8000000000000025,
                                          3.9000000000000026, ...]},
             scoring=make_scorer(mean_squared_error))
In [61]:
ridge_n = search_ridge.best_estimator_

#get cv results of the best model + confidence intervals
cv_score = cross_val_score(ridge_n, X_train_n, y_train_n, cv=tscv, scoring=scorer)
nlp_perf = nlp_perf.append({'Model':'Ridge', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True)
ridge_n
Out[61]:
Pipeline(steps=[('scale', StandardScaler()), ('model', Ridge(max_iter=5000))])
In [62]:
plotCoef(ridge_n['model'], X_train_n)

coefs = ridge_n['model'].coef_
ridge_coefs = pd.DataFrame({'Coef': coefs,
                           'Name': list(X_train_n.columns)})
ridge_coefs["abs"] = ridge_coefs.Coef.apply(np.abs)
ridge_coefs = ridge_coefs.sort_values(by="abs", ascending=False).drop(["abs"], axis=1)
ridge_coefs
Out[62]:
Coef Name
9 176.487936 compound_min_lag_2
8 174.531051 compound_min_lag_1
4 -162.089893 compound_mean_lag_1
2 155.208150 compound_min
0 -148.888611 compound_mean
5 -142.849053 compound_mean_lag_2
11 124.592908 subjectivity_mean_lag_2
3 120.468165 subjectivity_mean
10 119.085025 subjectivity_mean_lag_1
1 -37.363883 compound_max
6 -33.016224 compound_max_lag_1
7 -27.976245 compound_max_lag_2
In [63]:
mape(y_test, ridge_n.predict(X_test_n))
Out[63]:
58.838053988800944

Predicted values for years 2014-2016 using ridge regression.

In [64]:
print(ridge_n.predict(X_test_n))
[14716.64219864 14705.08819751 13980.10059014 13791.12076183
 14676.24671027 14686.22125486 14593.85122921 14000.79046681
 14502.44731183 14375.95034239 14688.04942339 14353.51442502
 14103.67449935 14116.34933655 14419.39078007 14596.81437977
 14614.03005235 14325.79387597 14481.54474328 14572.80654748
 14531.77731084 14357.0924025  13967.2853402  14151.41176422
 14673.88472455 14670.03226724 14688.91824998 14435.91990249
 14351.33827278 14128.78411089 14234.42232088 14861.38549773
 15279.58535055 15059.59589981 14443.75073891 13956.83386789
 13833.93647897 14014.45214329 14177.00415029 14720.5853178
 14915.55849948 15568.88895265 15174.39748957 14926.33275679
 14264.52157732 14140.31270066 14304.3287834  14283.84574707
 14575.90891318 14359.32563443 14806.19153749 14576.43917536
 14838.82908988 14763.04198214 14725.27659296 14738.37829882
 14851.09603646 14905.67521348 14676.88580713 14417.72132961
 14161.87091563 14535.78451018 14705.28795344 14520.6469565
 13937.47425067 13843.63461961 13878.02056841 14098.51448207
 14120.64336283 14190.20589615 14404.97163151 14231.25459752
 14320.8347799  14601.83043759 14474.05325609 14234.88630444
 13901.80475708 14273.04502341 14650.19989369 14855.28028944
 14260.70910616 14176.77206658 13708.05352069 14097.52991821
 14034.38394449 14030.21329192 13943.83940401 14319.54289611
 14564.61636532 14402.8173404  14127.87903737 14211.99920578
 14348.07969072 14242.29458968 14209.78296442 14378.57761623
 14215.47955442 14676.32372339 14644.97874967 14580.05272873
 14240.89240004 14301.72275924 14503.8527384  14203.54851615
 13965.13031537 14001.72100416 14025.33226088 14206.08792068
 13985.59838413 14133.38392954 14102.86127825 14668.35233182
 14562.34518647 14733.07848105 14539.49052096 14406.55151835
 13621.85061013 13570.68262424 13844.04619723 14392.31357706
 14131.07343929 14155.10529258 13993.26876016 14206.43133202
 14378.0046473  14835.46729212 15502.57008323 15218.63571377
 15302.13313737 14501.83494644 14419.89651712 13690.09821263
 13736.53048418 14203.27115274 14746.76002352 14909.21617424
 14950.85832902 15013.15104726 15061.36522134 15329.64720779
 14852.68998325 14454.58898225 14126.13446459 14169.47961413
 14184.18775849 13929.28814364 14255.14582607 14473.22594173
 14994.58803902 14887.01865092 14622.55825852 13943.52493983
 14386.61142009 14497.3910554  14459.62348537 14048.30395461
 14215.05877884 14548.76790326 14649.05232596 14369.10976287
 14402.08193348 14566.3757887  14679.55371263 14637.25178515
 14787.07610389 14879.67996445 14655.76977814 14378.14011295
 14514.25905664 14978.32104992 15019.48075036 14687.20993749
 14374.1472373  14742.13090956 15083.70814544 15466.63780497
 15063.23021744 14930.31140312 14444.87098957 14206.67297729
 13991.47401831 13839.1571768  14558.41328536 14712.04920657
 14804.49699833 14689.45684639 14572.0346241  14737.14276781
 14125.0690921  14720.68489501 14416.73934984 14755.84144373
 14490.30971554 14497.75072292 13936.11387616 14029.8531488
 14255.195171   14518.88950098 14196.59589289 14016.0312194
 14364.01782664 14378.56152216 14613.19111015 14269.74363911
 14228.68766343 14268.67781576 14283.44982074 14655.46616787
 14445.73644251 14764.1914451  14205.36508407 13879.0047777
 13555.594486   13943.79790917 14496.40956363 15145.8487519
 14704.02171912 14731.08494413 14516.36630965 14660.86706664
 14828.55599815 14869.54899713 14763.59567572 14748.72894736
 14188.49853281 14062.33483579 13601.02790733 14244.4908774
 14807.62033701 15456.86126196 15139.92743817 14965.83853208
 14627.98985371 14583.6944843  14706.53636414 15370.06898478
 15324.18533868 15059.83533961 14472.26304848 14902.73981264
 14963.2916588  15224.73111313 15148.46891322 14759.33088376
 14260.42113483 14025.13633076 14037.43433036 14110.66034914
 13935.41412928 14354.93482445 14378.8639393  14201.06770483
 13876.3725378  14062.34690631 14749.30802824 14843.32722401
 14897.42598085 14701.94714147 14758.79143772 14602.68662661
 14512.15048566 14249.04705052 14700.25942103 14629.49110556
 14782.20657905 14315.66851593 14362.3037637  14530.86873225
 14859.40380459 14782.67792277 14673.12335278 14348.64245286
 14710.42142076 15066.30048173 15155.08010436 15061.73365505
 14630.91512792 14669.42954189 13917.95752001 14173.6034541
 14393.45642682 14546.5064162  14230.96499422 13945.86047895
 14664.60897622 14993.48311086 15219.03019697 14991.03840957
 14829.98144342 14207.07795643 13877.66867487 14095.43888652
 14156.83349626 14564.98462163 14194.13788409 14499.05485127
 14162.47138374 14599.39197025 14260.44315095 14284.58118755
 14058.17928041 14401.32068577 14452.46983661 14585.4343436
 14698.2978979  14388.61095702 14235.9497422  14193.03173384
 14694.81967522 14852.25057428 14996.07383177 15001.40079986
 15119.2831724  14818.32519635 14452.21323673 14232.85313334
 14207.10456677 14351.18169279 14467.13773621 14244.23422577
 14050.95356768 13742.96324791 14008.06929296 14318.48379181
 14625.3835333  14724.02128212 14459.02228065 14345.5808906
 14019.11517863 14016.93708785 13943.23270623 13984.79317785
 13837.08764262 14240.29164778 14395.54938104 14880.86862355
 14353.84979976 14394.75863371 13703.79678625 14264.62871533
 14583.03099227 15081.81109235 14710.08933781 14463.9320048
 14518.98405573 14556.11787942 14812.49676162 14562.44311589
 14763.78967351 14515.52046661 14961.67454224 14851.16049538
 15042.54150799 14447.05456312 14184.02425911 13989.37337543
 14132.75756794 14370.27333966 14713.10843835 14804.88777045
 14890.51580018 14700.75114536 14406.38886716 14054.88234918
 13971.0540692  14195.7722975  14548.2479276  14099.33962386
 14145.56644245 14153.99855009 14930.22219983 14854.17114332
 14793.2380797  14347.79057461 14704.65271046 14644.9258826
 14852.22475245 14808.92334506 15390.64191726 14699.66362928
 14350.07441366 13730.44021683 13979.53734229 14572.80429856
 14976.00761934 15229.14912432 15014.14388732 15103.48335538
 15121.3050892  14573.29341893 14097.35318468 13801.1657952
 14231.2092012  14296.67260308 14222.26694147 13812.45226506
 13808.49358293]

Random Forest

In [65]:
rf_param = {'model__n_estimators': [10, 100, 300],
            'model__max_depth': [10, 20, 30, 40],
            'model__min_samples_split': [2, 5, 10],
            'model__min_samples_leaf': [1, 2, 3],
            'model__max_features': ["auto", 'sqrt']}
rf = RandomForestRegressor()
pipe = Pipeline([
    ('scale', scaler),
    ('model', rf)])
gridsearch_rf = GridSearchCV(estimator=pipe,
                          param_grid = rf_param,
                          scoring = scorer,
                          cv = tscv,
                          n_jobs=4,
                          verbose=3
                         )
gridsearch_rf.fit(X_train_n, y_train_n)
Fitting 10 folds for each of 216 candidates, totalling 2160 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  38 tasks      | elapsed:   10.0s
[Parallel(n_jobs=4)]: Done 136 tasks      | elapsed:   49.4s
[Parallel(n_jobs=4)]: Done 296 tasks      | elapsed:  1.7min
[Parallel(n_jobs=4)]: Done 520 tasks      | elapsed:  2.2min
[Parallel(n_jobs=4)]: Done 808 tasks      | elapsed:  3.8min
[Parallel(n_jobs=4)]: Done 1160 tasks      | elapsed:  5.3min
[Parallel(n_jobs=4)]: Done 1576 tasks      | elapsed:  7.0min
[Parallel(n_jobs=4)]: Done 2056 tasks      | elapsed:  9.2min
[Parallel(n_jobs=4)]: Done 2160 out of 2160 | elapsed:  9.5min finished
Out[65]:
GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=10),
             estimator=Pipeline(steps=[('scale', StandardScaler()),
                                       ('model', RandomForestRegressor())]),
             n_jobs=4,
             param_grid={'model__max_depth': [10, 20, 30, 40],
                         'model__max_features': ['auto', 'sqrt'],
                         'model__min_samples_leaf': [1, 2, 3],
                         'model__min_samples_split': [2, 5, 10],
                         'model__n_estimators': [10, 100, 300]},
             scoring=make_scorer(mean_squared_error), verbose=3)
In [66]:
rf_n = gridsearch_rf.best_estimator_

#get cv results of the best model + confidence intervals
cv_score = cross_val_score(rf_n, X_train_n, y_train_n, cv=tscv, scoring=scorer)
nlp_perf = nlp_perf.append({'Model':'RF', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True)

XGBoost

In [67]:
xgb_param = {'model__lambda': list(np.arange(1,10, 1)), #L2 regularisation
             'model__alpha': list(np.arange(1,10, 1)),  #L1 regularisation
            }
xgb = XGBRegressor(booster='gblinear', feature_selector='shuffle', objective='reg:squarederror')

pipe = Pipeline([
    ('scale', scaler),
    ('model', xgb)])
gridsearch_xgb = GridSearchCV(estimator=pipe,
                          param_grid = xgb_param,
                          scoring = scorer,
                          cv = tscv,
                          n_jobs=4,
                          verbose=3
                         )
gridsearch_xgb.fit(X_train_n, y_train_n)
Fitting 10 folds for each of 81 candidates, totalling 810 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  56 tasks      | elapsed:    0.7s
[Parallel(n_jobs=4)]: Done 440 tasks      | elapsed:    6.6s
[Parallel(n_jobs=4)]: Done 810 out of 810 | elapsed:   11.1s finished
Out[67]:
GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=10),
             estimator=Pipeline(steps=[('scale', StandardScaler()),
                                       ('model',
                                        XGBRegressor(base_score=None,
                                                     booster='gblinear',
                                                     colsample_bylevel=None,
                                                     colsample_bynode=None,
                                                     colsample_bytree=None,
                                                     feature_selector='shuffle',
                                                     gamma=None, gpu_id=None,
                                                     importance_type='gain',
                                                     interaction_constraints=None,
                                                     learni...
                                                     monotone_constraints=None,
                                                     n_estimators=100,
                                                     n_jobs=None,
                                                     num_parallel_tree=None,
                                                     random_state=None,
                                                     reg_alpha=None,
                                                     reg_lambda=None,
                                                     scale_pos_weight=None,
                                                     subsample=None,
                                                     tree_method=None,
                                                     validate_parameters=None,
                                                     verbosity=None))]),
             n_jobs=4,
             param_grid={'model__alpha': [1, 2, 3, 4, 5, 6, 7, 8, 9],
                         'model__lambda': [1, 2, 3, 4, 5, 6, 7, 8, 9]},
             scoring=make_scorer(mean_squared_error), verbose=3)
In [68]:
xgb_n = gridsearch_xgb.best_estimator_

#get cv results of the best model + confidence intervals
cv_score = cross_val_score(xgb_n, X_train_n, y_train_n, cv=tscv, scoring=scorer)
nlp_perf = nlp_perf.append({'Model':'XGB', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True)
xgb_n
Out[68]:
Pipeline(steps=[('scale', StandardScaler()),
                ('model',
                 XGBRegressor(alpha=9, base_score=0.5, booster='gblinear',
                              colsample_bylevel=None, colsample_bynode=None,
                              colsample_bytree=None, feature_selector='shuffle',
                              gamma=None, gpu_id=-1, importance_type='gain',
                              interaction_constraints=None, lambda=1,
                              learning_rate=0.5, max_delta_step=None,
                              max_depth=None, min_child_weight=None,
                              missing=nan, monotone_constraints=None,
                              n_estimators=100, n_jobs=0,
                              num_parallel_tree=None, random_state=0,
                              reg_alpha=0, reg_lambda=0, scale_pos_weight=1,
                              subsample=None, tree_method=None,
                              validate_parameters=1, verbosity=None))])
In [69]:
print(nlp_perf)
   Model           MSE            SD
0  Ridge  8.062476e+06  6.728594e+06
1     RF  8.257951e+06  7.146497e+06
2    XGB  8.062751e+06  6.728328e+06
In [70]:
nlp_fig = px.scatter(nlp_perf, x="Model", y='MSE', color='Model', error_y="SD")
#nlp_fig.update_layout(title_text="Performance of models trained on NLP features",
nlp_fig.show()

The 3 models are performing quite similary. They might be useful candidates for stacking.

Econ+NLP models

Let's use all features now

Ridge Regression

In [71]:
en_perf = pd.DataFrame(columns=['Model','MSE', 'SD'])
en_perf
Out[71]:
Model MSE SD
In [72]:
ridge_param = {'model__alpha': list(np.arange(0.1,1,0.01))}
ridge = Ridge(max_iter=5000)
pipe = Pipeline([
    ('scale', scaler),
    ('model', ridge)])
search_ridge = GridSearchCV(estimator=pipe,
                          param_grid = ridge_param,
                          scoring = scorer,
                          cv = tscv,
                          n_jobs=4,
                          verbose=3
                         )
search_ridge.fit(X_train, y_train)
Fitting 10 folds for each of 90 candidates, totalling 900 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  56 tasks      | elapsed:    0.3s
[Parallel(n_jobs=4)]: Done 824 tasks      | elapsed:    4.7s
[Parallel(n_jobs=4)]: Done 900 out of 900 | elapsed:    5.1s finished
Out[72]:
GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=10),
             estimator=Pipeline(steps=[('scale', StandardScaler()),
                                       ('model', Ridge(max_iter=5000))]),
             n_jobs=4,
             param_grid={'model__alpha': [0.1, 0.11, 0.12, 0.13,
                                          0.13999999999999999,
                                          0.14999999999999997,
                                          0.15999999999999998,
                                          0.16999999999999998,
                                          0.17999999999999997,
                                          0.18999999999999995,
                                          0.19999999999999996,
                                          0.2...
                                          0.23999999999999994,
                                          0.24999999999999992,
                                          0.2599999999999999,
                                          0.2699999999999999,
                                          0.2799999999999999,
                                          0.2899999999999999,
                                          0.29999999999999993,
                                          0.30999999999999994,
                                          0.3199999999999999,
                                          0.32999999999999985,
                                          0.33999999999999986,
                                          0.34999999999999987,
                                          0.3599999999999999,
                                          0.3699999999999999,
                                          0.3799999999999999,
                                          0.3899999999999999, ...]},
             scoring=make_scorer(mean_squared_error), verbose=3)
In [73]:
ridge_en = search_ridge.best_estimator_

#get cv results of the best model + confidence intervals
cv_score = cross_val_score(ridge_en, X_train, y_train, cv=tscv, scoring=scorer)
en_perf = en_perf.append({'Model':'Ridge', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True)
ridge_en
Out[73]:
Pipeline(steps=[('scale', StandardScaler()),
                ('model', Ridge(alpha=0.9899999999999995, max_iter=5000))])
In [74]:
coefs = ridge_en['model'].coef_
ridge_coefs = pd.DataFrame({'Coef': coefs,
                           'Name': list(X_train.columns)})
ridge_coefs["abs"] = ridge_coefs.Coef.apply(np.abs)
ridge_coefs = ridge_coefs.sort_values(by="abs", ascending=False).drop(["abs"], axis=1)
ridge_coefs
Out[74]:
Coef Name
12 2042.016441 Close_lag_1
13 415.436237 Close_lag_2
15 59.349649 Close_lag_4
14 28.210120 Close_lag_3
9 7.691580 compound_min_lag_2
0 7.462194 compound_mean
7 5.646883 compound_max_lag_2
1 -5.208500 compound_max
10 -2.649817 subjectivity_mean_lag_1
5 -2.106836 compound_mean_lag_2
8 1.471173 compound_min_lag_1
2 -1.411217 compound_min
3 -1.145039 subjectivity_mean
4 -0.809486 compound_mean_lag_1
6 -0.236423 compound_max_lag_1
11 0.020128 subjectivity_mean_lag_2
In [75]:
plotCoef(ridge_en['model'], X_train)

Random Forest

In [76]:
rf_param = {'model__n_estimators': [10, 100, 300],
            'model__max_depth': [10, 20, 30, 40],
            'model__min_samples_split': [2, 5, 10],
            'model__min_samples_leaf': [1, 2, 3],
            'model__max_features': ["auto", 'sqrt']}
rf = RandomForestRegressor()
pipe = Pipeline([
    ('scale', scaler),
    ('model', rf)])
gridsearch_rf = GridSearchCV(estimator=pipe,
                          param_grid = rf_param,
                          scoring = scorer,
                          cv = tscv,
                          n_jobs=4,
                          verbose=3
                         )
gridsearch_rf.fit(X_train, y_train)
Fitting 10 folds for each of 216 candidates, totalling 2160 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:    9.8s
[Parallel(n_jobs=4)]: Done 132 tasks      | elapsed:   48.9s
[Parallel(n_jobs=4)]: Done 292 tasks      | elapsed:  1.7min
[Parallel(n_jobs=4)]: Done 516 tasks      | elapsed:  2.4min
[Parallel(n_jobs=4)]: Done 804 tasks      | elapsed:  4.2min
[Parallel(n_jobs=4)]: Done 1156 tasks      | elapsed:  5.5min
[Parallel(n_jobs=4)]: Done 1572 tasks      | elapsed:  7.4min
[Parallel(n_jobs=4)]: Done 2052 tasks      | elapsed:  9.7min
[Parallel(n_jobs=4)]: Done 2160 out of 2160 | elapsed: 10.1min finished
Out[76]:
GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=10),
             estimator=Pipeline(steps=[('scale', StandardScaler()),
                                       ('model', RandomForestRegressor())]),
             n_jobs=4,
             param_grid={'model__max_depth': [10, 20, 30, 40],
                         'model__max_features': ['auto', 'sqrt'],
                         'model__min_samples_leaf': [1, 2, 3],
                         'model__min_samples_split': [2, 5, 10],
                         'model__n_estimators': [10, 100, 300]},
             scoring=make_scorer(mean_squared_error), verbose=3)
In [77]:
rf_en = gridsearch_rf.best_estimator_

#get cv results of the best model + confidence intervals
cv_score = cross_val_score(rf_en, X_train, y_train, cv=tscv, scoring=scorer)
en_perf = en_perf.append({'Model':'RF', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True)
rf_en
Out[77]:
Pipeline(steps=[('scale', StandardScaler()),
                ('model',
                 RandomForestRegressor(max_depth=40, max_features='sqrt',
                                       min_samples_leaf=3, min_samples_split=10,
                                       n_estimators=10))])

XGBoost

In [78]:
xgb_param = {'model__lambda': list(np.arange(1,10, 1)), #L2 regularisation
             'model__alpha': list(np.arange(1,10, 1)),  #L1 regularisation
            }
xgb = XGBRegressor(booster='gblinear', feature_selector='shuffle', objective='reg:squarederror')

pipe = Pipeline([
    ('scale', scaler),
    ('model', xgb)])
gridsearch_xgb = GridSearchCV(estimator=pipe,
                          param_grid = xgb_param,
                          scoring = scorer,
                          cv = tscv,
                          n_jobs=4,
                          verbose=3
                         )
gridsearch_xgb.fit(X_train, y_train)
Fitting 10 folds for each of 81 candidates, totalling 810 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  56 tasks      | elapsed:    0.7s
[Parallel(n_jobs=4)]: Done 440 tasks      | elapsed:    6.0s
[Parallel(n_jobs=4)]: Done 810 out of 810 | elapsed:   10.8s finished
Out[78]:
GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=10),
             estimator=Pipeline(steps=[('scale', StandardScaler()),
                                       ('model',
                                        XGBRegressor(base_score=None,
                                                     booster='gblinear',
                                                     colsample_bylevel=None,
                                                     colsample_bynode=None,
                                                     colsample_bytree=None,
                                                     feature_selector='shuffle',
                                                     gamma=None, gpu_id=None,
                                                     importance_type='gain',
                                                     interaction_constraints=None,
                                                     learni...
                                                     monotone_constraints=None,
                                                     n_estimators=100,
                                                     n_jobs=None,
                                                     num_parallel_tree=None,
                                                     random_state=None,
                                                     reg_alpha=None,
                                                     reg_lambda=None,
                                                     scale_pos_weight=None,
                                                     subsample=None,
                                                     tree_method=None,
                                                     validate_parameters=None,
                                                     verbosity=None))]),
             n_jobs=4,
             param_grid={'model__alpha': [1, 2, 3, 4, 5, 6, 7, 8, 9],
                         'model__lambda': [1, 2, 3, 4, 5, 6, 7, 8, 9]},
             scoring=make_scorer(mean_squared_error), verbose=3)
In [79]:
xgb_en = gridsearch_xgb.best_estimator_

#get cv results of the best model + confidence intervals
cv_score = cross_val_score(xgb_en, X_train, y_train, cv=tscv, scoring=scorer)
en_perf = en_perf.append({'Model':'XGB', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}, ignore_index=True)
xgb_en
Out[79]:
Pipeline(steps=[('scale', StandardScaler()),
                ('model',
                 XGBRegressor(alpha=9, base_score=0.5, booster='gblinear',
                              colsample_bylevel=None, colsample_bynode=None,
                              colsample_bytree=None, feature_selector='shuffle',
                              gamma=None, gpu_id=-1, importance_type='gain',
                              interaction_constraints=None, lambda=9,
                              learning_rate=0.5, max_delta_step=None,
                              max_depth=None, min_child_weight=None,
                              missing=nan, monotone_constraints=None,
                              n_estimators=100, n_jobs=0,
                              num_parallel_tree=None, random_state=0,
                              reg_alpha=0, reg_lambda=0, scale_pos_weight=1,
                              subsample=None, tree_method=None,
                              validate_parameters=1, verbosity=None))])

Trying to stack econometric and NLP models

In [80]:
from sklearn.model_selection import cross_val_predict

X_train_stack = pd.DataFrame(pd.DataFrame(columns=['econ_r', 'nlp_r']))
X_train_stack['econ_r'] = cross_val_predict(ridge_e, X_train_e, y_train, cv=10)
X_train_stack['nlp_r'] = cross_val_predict(ridge_n, X_train_n, y_train, cv=10)

X_test_stack = pd.DataFrame(pd.DataFrame(columns=['econ_r', 'nlp_r']))
X_test_stack['econ_r'] = ridge_e.predict(X_test_e)
X_test_stack['nlp_r'] = ridge_n.predict(X_test_n)

X_train_stack.to_csv("Stack_train.csv")
X_test_stack.to_csv("Stack_test.csv")

from sklearn.linear_model import ElasticNetCV
stack = ElasticNetCV(cv=tscv)
stack.fit(X_train_stack, y_train)
cv_score = cross_val_score(stack, X_train_stack, y_train, cv=tscv, scoring=scorer)
stack_performance = {'Model':'XGB', 'MSE':np.mean(cv_score), 'SD':(np.std(cv_score))}
stack_performance

mape(y_test, stack.predict(X_test_stack))
Out[80]:
1.428419912765189
In [81]:
coefs = stack.coef_
ridge_coefs = pd.DataFrame({'Coef': coefs,
                           'Name': list(X_train_stack.columns)})
ridge_coefs["abs"] = ridge_coefs.Coef.apply(np.abs)
ridge_coefs = ridge_coefs.sort_values(by="abs", ascending=False).drop(["abs"], axis=1)
print(ridge_coefs)
plotCoef(stack, X_train_stack)
      Coef    Name
0  0.99867  econ_r
1 -0.00000   nlp_r

Model Comparison

prediction_compare = pd.DataFrame(pd.DataFrame(columns=['y_true', 'econ_r', 'econ_rf', 'econ_x', 'nlp_r', 'nlp_rf', 'nlp_x', 'comb_r', 'comb_rf', 'comb_x', 'stack'])) prediction_compare['y_true'] = y_test prediction_compare['econ_r'] = ridge_e.predict(X_test_e) prediction_compare['econ_rf'] = rf_e.predict(X_test_e) prediction_compare['econ_x'] = xgb_e.predict(X_test_e) prediction_compare['nlp_r'] = ridge_n.predict(X_test_n) prediction_compare['nlp_rf'] = rf_n.predict(X_test_n) prediction_compare['nlp_x'] = xgb_n.predict(X_test_n) prediction_compare['comb_r'] = ridge_en.predict(X_test) prediction_compare['comb_rf'] = rf_en.predict(X_test) prediction_compare['comb_x'] = xgb_en.predict(X_test) prediction_compare['stack'] = stack.predict(X_test_stack)

prediction_compare.sample(3)

Outcomes

->The market is arguably to be a random walk. Although there is some direction to its movements, there is still quite a bit of randomness to its movements.

->The news that we used might not be the most relevant. Perhaps it would have been better to use news relating to the 30 companies that make up the Dow.

->More information could have been included in the model, such as the previous day(s)'s change, the previous day(s)'s main headline(s).

->Many people have worked on this task for years and companies spend millions of dollars to try to predict the movements of the market, so we shouldn't expect anything too great considering the small amount of data that we are working with and the simplicity of our model.